Skip to main content

Some recap of previous courses

Sufficient Statistics

We define a sufficient statistic as a statistic that conveys exactly the same information about the parameter as the entire data.

Fisher-Neyman Factorization Theorem: T(x)T(x) is a sufficient statistic for the parameter θ\theta in the parametric model p(xθ)p(x|\theta) if and only if p(xθ)=h(x)gθ(T(x))p(x|\theta) = h(x)g_{\theta}(T(x)) for some functions h(x)h(x) (does not depend on θ\theta) and gθ(T(x))g_{\theta}(T(x)).

Exponential Families: p(xη)=h(x)exp(ηTT(x)A(η))p(x|\eta) = h(x) \exp(\eta^T T(x) - A(\eta))

  • T(x)T(x) is a sufficient statistic
  • η\eta is the natural parameter
  • A(η)A(\eta) is the log-partition function
  • h(x)h(x) is the carrying measuremn

Decision Theory

Estimating p(x,c)p(x,c) from training data is an example of inference.

We define a decision problem has a discriminant rules where is to find regions Rj=minjkLkjp(Ckx)R_j = \min_{j} \sum_k L_{kj}p(C_k|x) or Rj={x:kLkjp(Ckx)<kLkip(Ckx),ij}R_j = \{x: \sum_k L_{kj}p(C_k|x) < \sum_k L_{ki}p(C_k|x), \forall i\ne j\}

For regression problems, we want to find E[L]=L(y(x),t)p(x,t)dxdtE[L] = \int\int L(y(x), t) p(x, t) dx dt where L(y(x),t)L(y(x), t) is the loss function. The least squares loss L(y(x),t)=(y(x)t)2L(y(x), t) = (y(x) - t)^2 leads to the equation E[L]=(y(x)E[tx])2p(x,t)dxdt+(E[tx]t)2p(x,t)dxdtE[L] =\int\int (y(x) - E[t|x])^2 p(x,t)dxdt + \int\int (E[t|x] - t)^2 p(x,t)dxdt. That is, only the first term based on y(x)y(x) so that we have y(x)=E[tx]y(x) = E[t|x].

  • The second term is the variance of txt|x

Multivariate Gaussian

Let μRm\mu \in \R^m and Σ\Sigma symmetric positive definite matrix m×mm \times m matrix. We write XNm(μ,Σ)X \sim N_m(\mu, \Sigma) if pdf of XX is given by f(x)=1(2π)m/2Σ1/2exp(12(xμ)TΣ1(xμ))f(x) = \frac{1}{(2\pi)^{m/2}|\Sigma|^{1/2}}\exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)). By multivariate Gaussian properties, we have some useful properties:

  • each marginal XiX_i is Gaussian with mean μi\mu_i and variance σi2=Σii\sigma_i^2 = \Sigma_{ii}.
  • the conditional distribution of XjX_j given XiX_i is Gaussian with mean μj+ΣjiΣii1(xiμi)\mu_j + \Sigma_{ji}\Sigma_{ii}^{-1}(x_i - \mu_i) and variance ΣjjΣjiΣii1Σij\Sigma_{jj} - \Sigma_{ji}\Sigma_{ii}^{-1}\Sigma_{ij}.
  • XiX_i and XjX_j are independent if and only if Σij=0\Sigma_{ij} = 0 for all iji \ne j.
  • XiXjXk    Σij=ΣikΣkk1Σkj    Σij1=0X_i \perp X_j | X_k\iff \Sigma_{ij} = \Sigma_{ik}\Sigma_{kk}^{-1}\Sigma_{kj} \iff \Sigma^{-1}_{ij} = 0.
  • we have Y=AX+bY = AX + b where ARm×nA \in \R^{m \times n} and bRmb \in \R^m is a vector. Then YNn(Aμ+b,AΣAT)Y \sim N_n(A\mu + b, A\Sigma A^T).

Bayesian Inference

We always use bayesian inference for the latent variable model p(x,z)=p(z)p(xz)p(x, z) = p(z)p(x|z), where:

  • xx are the observations or data
  • zz are the unobserved or latent variables
  • p(z)p(z) is the prior distribution of zz
  • p(xz)p(x|z) is the likelihood function of xx given zz
  • p(zx)p(z|x) is the posterior distribution of zz given xx or the conditional distribution of unobserved variables given the observed data. More generally, we have p(zx)=p(xz)p(z)p(x)p(z|x) = \frac{p(x|z)p(z)}{p(x)} where p(x)=p(xz)p(z)dzp(x) = \int p(x|z)p(z)dz is the marginal likelihood.